Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.6 - Filtering Data")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

pets = spark.read.csv(path, header=True)
pets.toPandas()

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None
2	3	1	Chewie	2016-11-22 10:05:10	15	None

Filtering Data

Again another commonly used function in data analysis, filtering out unwanted rows.

Option 1 - `where()`

(
    pets
    .where(F.col('breed_id') == 1)
    .filter(F.col('color') == 'brown')
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown

What Happened?

Similar to the functions we have seen so far, there are multiple functioned that get alias to different names that perform the same transformation. IMO I perfor where as it's a bit more intuitive and closer to the sql syntax.

Note:

Notice how we don't have to wrap 1 or brown in a F.lit() function as these conditions are columnary expressions.

We will look into how to perform more complex conditions in 2.1.7 that contain more than 1 condition.

Option 2 - `isin()`

(
    pets
    .where(F.col('nickname').isin('King', 'Argus'))
    .toPandas()
)

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None

What Happened?

If you want to know if a column can be of many values then you can use the isin() function. This function takes in both a list of values of comma seperated values. This is again very similar to sql syntax.

Summary

We learnt of two filter functions in Spark where() and isin().
Using isin you can see if a column can contain multiple values.
These functions are named similarly to a sql language.

Section 2.6 - Filtering Data

Library Imports

Template

Filtering Data

Option 1 - `where()`

Option 2 - `isin()`

Summary

results matching ""

No results matching ""

Library Imports

Template

Filtering Data

Option 1 - where()

Option 2 - isin()

Summary

results matching ""

No results matching ""

Option 1 - `where()`

Option 2 - `isin()`